Using R to Examine Precipitation Data

Over the Fall of 2021, Dr. Mullens and I studied precipitation extremes in the northeastern U.S. As a statistics major, I was happy to accept this research project because it allowed me to use what I have been learning in my courses in a serious setting. Researching the climate, in particular, is such a worthwhile use of statistics, since the results of climate research by organizations such as the IPCC are used to drive public policy and determine what we can expect our world to look like going forward.

At the beginning of the semester, we obtained the projections of over 30 climate models from the Coupled Model Intercomparison Project version 5 (found at https://www.climdex.org/). Our focus was on examining precipitation in the Northeastern US, including New England, most of New York, and part of the lower Canadian border (41-47.3 N and 66-80W). We only considered land here, that is, we ignored any points over the ocean. We used a measure from the Climdex called Rx5day, which is the maximum amount of precipitation over a 5-day interval of each month. Our data contained this metric for every month from as early as 1850 to 2011 (found here).

In this study, we use reanalysis data as the ‘ground truth’, which we evaluate the climate model projections in the historical period (for more on reanalysis – see here. We used the ERA40 and NCEP-NCAR reanalysis, available from 1957-2011 and 1948-2012 respectively.

First we compared to models to each other, to see how much precipitation each model projected. Then we compared the models to the historically observed climate data (called reanalysis) over the same time period.

An interactive graph below allows you to explore the data, too! It is a density plot, measuring how likely certain outcomes are to occur. A model with a tall distribution (like any of the blue, reanalysis data) means that it projects maximum 5-day precipitation mostly around one amount, and one with a wide distribution (like one of the models starting with “IPSL”) projects various levels of precipitation. Here are some ways to get the most out of the graph:

(Graph constructed using R and the tidyverse and plotly packages.)

Next, we investigated how often the Rx5day metric changed drastically by model. This inquiry would tell us how volatile each model is. We chose 1961-1990 as our base period, which all our models and the reanalysis had in common, for an accurate comparison.

We called a 2-month sequence in which Rx5day went from above the 75th percentile of all Rx5day amounts to below the 25th percentile, or vice-versa, “whiplash events.” For the name, we visualized the heavy precipitation suddenly switching direction (from high to low or vice-versa), like the jerk of a whip, over the two months.

The graph below shows the number of whiplash events per model over our base period. The most interesting piece of information here is that the reanalysis models (in red), which contain historical precipitation data, don’t have the same number of whiplashes at all: 23 vs 43. This difference suggests that the number of whiplash events alone is not a good way to compare models to each other. We expected the two reanalyses in red to have about the same number of whiplashes. The fact that they didn’t means the two data collections differ in their magnitudes of Rx5day and its distribution, resulting in one being less volatile than the other.

But maybe we should count not only the number of whiplashes, but their magnitudes. That is, we take the difference between the Rx5day values for the two months of each whiplash, then make every number positive. We found these numbers and plotted them too (in fact, a majority of my research boiled down to coming up with questions, then figuring out how to plot the answers in R). Notice that here as well, when we order the number of whiplashes (left and in red), the average magnitude of the whiplashes (right and in blue) fluctuates randomly. Therefore, we cannot use the number of whiplash events in a model to predict the magnitude of those whiplash events.

We were not done looking at whiplashes, however. We still had not looked at data projecting into the future. Each model contained multiple precipitation projections for the next several decades, differing depending on greenhouse gas emissions scenarios. We call these scenarios, “Representative Concentration Pathways” (RCP), followed by a number denoting how much greenhouse gases are in the atmosphere. RCP2.6 is the “ideal scenario”, limiting global warming to below 2 degrees Celsius as in the Paris Agreement, and RCP8.5 is the “worst case scenario”, that is, if nobody changes their behavior. For this research, we used RCP4.5 and RCP8.5.

We decided to compare the number of whiplash events and their average magnitudes depending on scenario, and relate them back to the model projections in the past. Just as we used a 30-year baseline for the historical projections (1961-1990), we used the period 2031-2060 as our baseline for the future.

(Note that some of our models lacked future scenarios or a historical projection in some cases, which is why some rows of the graph have fewer than three data points.)

We can take away some observations from this graph:

Now we have insights from the whiplash concept! That is, the models forecast a future with a more stark contrast between rainy months and dry months, in the Northeast United States. Even though most of the models have a historical period mean that is greater than the reanalysis mean, we can look at the relative changes between each model’s historical period and future periods, which will most likely resemble the changes in the future.

Conclusions and investigations like these are why I love statistics. Can we predict one thing from another? We find out using statistical methods, or sometimes, a glance at a plot is all we need to see a pattern or a lack of association.

These last plots are meant for exploration: they show the density plots for all our models, within 30-year time constraints. The historical data in the graph only uses the years 1961-1990, and the future uses the 30-year period 2031-2060. One can find out using this interactive plot the character of each model. All the guidelines for navigating the first plot apply: double-click a model in the legend on the right to isolate it, then hover over a plot to see its mean and variance. This version makes it easier to compare how each model changes depending on future scenario.

Hopefully, this article has given you a clue as to what researching the climate as an undergraduate in statistics is like. Thank you for reading!